{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Feature Engineering\n",
    "\n",
    "**Learning Objectives**\n",
    "  * Improve the accuracy of a model by using feature engineering\n",
    "  * Understand there's two places to do feature engineering in Tensorflow\n",
    "    1. Using the `tf.feature_column` module\n",
    "    2. In the input functions\n",
    "    \n",
    "## Introduction\n",
    "Up until now we've been focusing on Tensorflow mechanics to make sure our code works, we have neglected model performance, which at this point is **9.26 RMSE**. \n",
    "\n",
    "In this notebook we'll attempt to improve on that using feature engineering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "import shutil\n",
    "print(tf.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load raw data\n",
    "\n",
    "These are the same files created in the `create_datasets.ipynb` notebook"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!gcloud storage cp gs://cloud-training-demos/taxifare/small/*.csv .\n",
    "!ls -l *.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train and Evaluate input functions\n",
    "\n",
    "These are the same as before with one additional line of code: a call to `add_engineered_features()` from within the `_parse_row()` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CSV_COLUMN_NAMES = [\"fare_amount\",\"dayofweek\",\"hourofday\",\"pickuplon\",\"pickuplat\",\"dropofflon\",\"dropofflat\"]\n",
    "CSV_DEFAULTS = [[0.0],[1],[0],[-74.0],[40.0],[-74.0],[40.7]]\n",
    "\n",
    "def read_dataset(csv_path):\n",
    "    def _parse_row(row):\n",
    "        # Decode the CSV row into list of TF tensors\n",
    "        fields = tf.decode_csv(records = row, record_defaults = CSV_DEFAULTS)\n",
    "\n",
    "        # Pack the result into a dictionary\n",
    "        features = dict(zip(CSV_COLUMN_NAMES, fields))\n",
    "        \n",
    "        # NEW: Add engineered features\n",
    "        features = add_engineered_features(features)\n",
    "        \n",
    "        # Separate the label from the features\n",
    "        label = features.pop(\"fare_amount\") # remove label from features and store\n",
    "\n",
    "        return features, label\n",
    "    \n",
    "    # Create a dataset containing the text lines.\n",
    "    dataset = tf.data.Dataset.list_files(file_pattern = csv_path) # (i.e. data_file_*.csv)\n",
    "    dataset = dataset.flat_map(map_func = lambda filename:tf.data.TextLineDataset(filenames = filename).skip(count = 1))\n",
    "\n",
    "    # Parse each CSV row into correct (features,label) format for Estimator API\n",
    "    dataset = dataset.map(map_func = _parse_row)\n",
    "    \n",
    "    return dataset\n",
    "\n",
    "def train_input_fn(csv_path, batch_size = 128):\n",
    "    #1. Convert CSV into tf.data.Dataset with (features,label) format\n",
    "    dataset = read_dataset(csv_path)\n",
    "      \n",
    "    #2. Shuffle, repeat, and batch the examples.\n",
    "    dataset = dataset.shuffle(buffer_size = 1000).repeat(count = None).batch(batch_size = batch_size)\n",
    "   \n",
    "    return dataset\n",
    "\n",
    "def eval_input_fn(csv_path, batch_size = 128):\n",
    "    #1. Convert CSV into tf.data.Dataset with (features,label) format\n",
    "    dataset = read_dataset(csv_path)\n",
    "\n",
    "    #2.Batch the examples.\n",
    "    dataset = dataset.batch(batch_size = batch_size)\n",
    "   \n",
    "    return dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Engineering: feature columns\n",
    "\n",
    "There are two places in Tensorflow where we can do feature engineering. The first is using the `tf.feature_column` package. This allows us easily \n",
    "\n",
    "- bucketize continuous features\n",
    "- one hot encode categorical features\n",
    "- create feature crosses\n",
    "\n",
    "For details on the possible `tf.feature_column` transformations and when to use each see the [official guide](https://www.tensorflow.org/guide/feature_columns).\n",
    "\n",
    "Let's use `tf.feature_column` to create a feature that shows the combination of day of week and hour of day. This will allow our model to easily learn the difference between say Wednesday at 5pm (rush hour, expect higher fares) and Sunday at 5pm (light traffic, expect lower fares).\n",
    "\n",
    "Let's also use it to bucketize our latitudes and longitudes because treating them as continuous numbers is misleading to the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### **Exercise 1**\n",
    "\n",
    "In the code cell below you are asked to create some additional crossed feature columns for the model. For `fc_crossed_dloc` create a crossed column using the bucketized dropoff latitude and dropoff longitude. For `fc_crossed_ploc` create a crossed column using the pickup latitude and pickup longitude. For `fc_crossed_pd_pair` create a crossed feature column using the pickup location and the dropoff location. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 1. One hot encode dayofweek and hourofday\n",
    "fc_dayofweek = tf.feature_column.categorical_column_with_identity(key = \"dayofweek\", num_buckets = 7)\n",
    "fc_hourofday = tf.feature_column.categorical_column_with_identity(key = \"hourofday\", num_buckets = 24)\n",
    "\n",
    "# 2. Bucketize latitudes and longitudes\n",
    "NBUCKETS = 16\n",
    "latbuckets = np.linspace(start = 38.0, stop = 42.0, num = NBUCKETS).tolist()\n",
    "lonbuckets = np.linspace(start = -76.0, stop = -72.0, num = NBUCKETS).tolist()\n",
    "fc_bucketized_plat = tf.feature_column.bucketized_column(source_column = tf.feature_column.numeric_column(key = \"pickuplon\"), boundaries = lonbuckets)\n",
    "fc_bucketized_plon = tf.feature_column.bucketized_column(source_column = tf.feature_column.numeric_column(key = \"pickuplat\"), boundaries = latbuckets)\n",
    "fc_bucketized_dlat = tf.feature_column.bucketized_column(source_column = tf.feature_column.numeric_column(key = \"dropofflon\"), boundaries = lonbuckets)\n",
    "fc_bucketized_dlon = tf.feature_column.bucketized_column(source_column = tf.feature_column.numeric_column(key = \"dropofflat\"), boundaries = latbuckets)\n",
    "\n",
    "# 3. Cross features to get combination of day and hour\n",
    "fc_crossed_day_hr = tf.feature_column.crossed_column(keys = [fc_dayofweek, fc_hourofday], hash_bucket_size = 24 * 7)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Engineering: input functions\n",
    "\n",
    "While feature columns are very powerful, what happens when we want to something that there isn't a feature column for?\n",
    "\n",
    "Recall the input functions recieve csv data, format it, then pass it batch by batch to the model. We can also use input functions to inject arbitrary tensorflow code to manipulate the data.\n",
    "\n",
    "However, we need to be careful that any transformations we do in one input function, we do for all, otherwise we'll have [training-serving skew](https://developers.google.com/machine-learning/guides/rules-of-ml/#training-serving_skew).\n",
    "\n",
    "To guard against this we encapsulate all input function feature engineering in a single function, `add_engineered_features()`, and call this function from every input function.\n",
    "\n",
    "Let's calculate the euclidean distance between the pickup and dropoff points and feed that as a new feature to our model. \n",
    "\n",
    "Also it may be useful to know which cardinal direction that distance is in. I suspect that distance is cheaper to travel North/South because in Manhatten streets that run North/South have less stops than streets that run East/West."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def add_engineered_features(features):\n",
    "    features[\"dayofweek\"] = features[\"dayofweek\"] - 1 # subtract one since our days of week are 1-7 instead of 0-6\n",
    "    \n",
    "    features[\"latdiff\"] = features[\"pickuplat\"] - features[\"dropofflat\"] # East/West\n",
    "    features[\"londiff\"] = features[\"pickuplon\"] - features[\"dropofflon\"] # North/South\n",
    "    features[\"euclidean_dist\"] = tf.sqrt(x = features[\"latdiff\"]**2 + features[\"londiff\"]**2)\n",
    "\n",
    "    return features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gather list of feature columns\n",
    "\n",
    "Ultimately our estimator expects a list of feature columns, so let's gather all our engineered features into a single list.\n",
    "\n",
    "We cannot pass categorical or crossed feature columns directly into a DNN, Tensorflow will give us an error. We must first wrap them using either `indicator_column()` or `embedding_column()`. The former will pass through the one-hot encoded representation as is, the latter will embed the feature into a dense representation of specified dimensionality (the 4th root of the number of categories is a good starting point for number of dimensions). Read more about indicator and embedding columns [here](https://www.tensorflow.org/guide/feature_columns#indicator_and_embedding_columns).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_cols = [\n",
    "    #1. Engineered using tf.feature_column module\n",
    "    tf.feature_column.indicator_column(categorical_column = fc_crossed_day_hr),\n",
    "    fc_bucketized_plat,\n",
    "    fc_bucketized_plon,\n",
    "    fc_bucketized_dlat,\n",
    "    fc_bucketized_dlon,\n",
    "    #2. Engineered in input functions\n",
    "    tf.feature_column.numeric_column(key = \"latdiff\"),\n",
    "    tf.feature_column.numeric_column(key = \"londiff\"),\n",
    "    tf.feature_column.numeric_column(key = \"euclidean_dist\") \n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Serving Input Receiver function \n",
    "\n",
    "Same as before except the received tensors are wrapped with `add_engineered_features()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def serving_input_receiver_fn():\n",
    "    receiver_tensors = {\n",
    "        'dayofweek' : tf.placeholder(dtype = tf.int32, shape = [None]), # shape is vector to allow batch of requests\n",
    "        'hourofday' : tf.placeholder(dtype = tf.int32, shape = [None]),\n",
    "        'pickuplon' : tf.placeholder(dtype = tf.float32, shape = [None]), \n",
    "        'pickuplat' : tf.placeholder(dtype = tf.float32, shape = [None]),\n",
    "        'dropofflat' : tf.placeholder(dtype = tf.float32, shape = [None]),\n",
    "        'dropofflon' : tf.placeholder(dtype = tf.float32, shape = [None]),\n",
    "    }\n",
    "    \n",
    "    features = add_engineered_features(receiver_tensors) # 'features' is what is passed on to the model\n",
    "    \n",
    "    return tf.estimator.export.ServingInputReceiver(features = features, receiver_tensors = receiver_tensors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train and Evaluate (500 train steps)\n",
    "\n",
    "The same as before, we'll train the model for 500 steps (sidenote: how many epochs do 500 trains steps represent?). Let's see how the engineered features we've added affect the performance. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "OUTDIR = \"taxi_trained_dnn/500\"\n",
    "shutil.rmtree(path = OUTDIR, ignore_errors = True) # start fresh each time\n",
    "tf.summary.FileWriterCache.clear() # ensure filewriter cache is clear for TensorBoard events file\n",
    "tf.logging.set_verbosity(v = tf.logging.INFO) # so loss is printed during training\n",
    "\n",
    "model = tf.estimator.DNNRegressor(\n",
    "    hidden_units = [10,10], # specify neural architecture\n",
    "    feature_columns = feature_cols, \n",
    "    model_dir = OUTDIR,\n",
    "    config = tf.estimator.RunConfig(\n",
    "        tf_random_seed = 1, # for reproducibility\n",
    "        save_checkpoints_steps = 100 # checkpoint every N steps\n",
    "    ) \n",
    ")\n",
    "\n",
    "# Add custom evaluation metric\n",
    "def my_rmse(labels, predictions):\n",
    "    pred_values = tf.squeeze(input = predictions[\"predictions\"], axis = -1)\n",
    "    return {\"rmse\": tf.metrics.root_mean_squared_error(labels = labels, predictions = pred_values)}\n",
    "\n",
    "model = tf.contrib.estimator.add_metrics(estimator = model, metric_fn = my_rmse) \n",
    "    \n",
    "train_spec = tf.estimator.TrainSpec(\n",
    "    input_fn = lambda: train_input_fn(\"./taxi-train.csv\"),\n",
    "    max_steps = 500)\n",
    "\n",
    "exporter = tf.estimator.FinalExporter(name = \"exporter\", serving_input_receiver_fn = serving_input_receiver_fn) # export SavedModel once at the end of training\n",
    "# Note: alternatively use tf.estimator.BestExporter to export at every checkpoint that has lower loss than the previous checkpoint\n",
    "\n",
    "eval_spec = tf.estimator.EvalSpec(\n",
    "    input_fn = lambda: eval_input_fn(\"./taxi-valid.csv\"),\n",
    "    steps = None,\n",
    "    start_delay_secs = 1, # wait at least N seconds before first evaluation (default 120)\n",
    "    throttle_secs = 1, # wait at least N seconds before each subsequent evaluation (default 600)\n",
    "    exporters = exporter) # export SavedModel once at the end of training\n",
    "\n",
    "tf.estimator.train_and_evaluate(estimator = model, train_spec = train_spec, eval_spec = eval_spec)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Results\n",
    "\n",
    "Our RMSE is now **5.94**, our first significant improvement! If we look at the RMSE trend in TensorBoard it appears the model is still learning, so training past 500 steps would likely lower the RMSE even more. Let's run again, this time for 10x as many steps."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train and Evaluate (5,000 train steps)\n",
    "\n",
    "Now, just as above, we'll execute a longer trianing job with 5,000 train steps using our engineered features and assess the performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "OUTDIR = \"taxi_trained_dnn/5000\"\n",
    "shutil.rmtree(path = OUTDIR, ignore_errors = True) # start fresh each time\n",
    "tf.summary.FileWriterCache.clear() # ensure filewriter cache is clear for TensorBoard events file\n",
    "tf.logging.set_verbosity(v = tf.logging.INFO) # so loss is printed during training\n",
    "\n",
    "model = tf.estimator.DNNRegressor(\n",
    "    hidden_units = [10,10], # specify neural architecture\n",
    "    feature_columns = feature_cols, \n",
    "    model_dir = OUTDIR,\n",
    "    config = tf.estimator.RunConfig(\n",
    "        tf_random_seed = 1, # for reproducibility\n",
    "        save_checkpoints_steps = 100 # checkpoint every N steps\n",
    "    ) \n",
    ")\n",
    "\n",
    "# Add custom evaluation metric\n",
    "def my_rmse(labels, predictions):\n",
    "    pred_values = tf.squeeze(input = predictions[\"predictions\"], axis = -1)\n",
    "    return {\"rmse\": tf.metrics.root_mean_squared_error(labels = labels, predictions = pred_values)}\n",
    "\n",
    "model = tf.contrib.estimator.add_metrics(estimator = model, metric_fn = my_rmse) \n",
    "    \n",
    "train_spec = tf.estimator.TrainSpec(\n",
    "    input_fn = lambda: train_input_fn(\"./taxi-train.csv\"),\n",
    "    max_steps = 5000)\n",
    "\n",
    "exporter = tf.estimator.FinalExporter(name = \"exporter\", serving_input_receiver_fn = serving_input_receiver_fn) # export SavedModel once at the end of training\n",
    "# Note: alternatively use tf.estimator.BestExporter to export at every checkpoint that has lower loss than the previous checkpoint\n",
    "\n",
    "eval_spec = tf.estimator.EvalSpec(\n",
    "    input_fn = lambda: eval_input_fn(\"./taxi-valid.csv\"),\n",
    "    steps = None,\n",
    "    start_delay_secs = 1, # wait at least N seconds before first evaluation (default 120)\n",
    "    throttle_secs = 1, # wait at least N seconds before each subsequent evaluation (default 600)\n",
    "    exporters = exporter) # export SavedModel once at the end of training\n",
    "\n",
    "tf.estimator.train_and_evaluate(estimator = model, train_spec = train_spec, eval_spec = eval_spec)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Results\n",
    "\n",
    "Our RMSE is now **4.13**! It looks like RMSE may still be reducing, but training is getting slow so we should move to the cloud if we want to train longer.\n",
    "\n",
    "Also we haven't explored our hyperparameters much. Is our neural architecture of two layers with 10 nodes each optimal? \n",
    "\n",
    "In the next notebook we'll explore this."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
